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ABSTRACT 
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1.  Introduction 

Various  iterative  methods,  and  multigrid  in  particular,  are  often  used  to  solve  the  large  linear  systems 
that  arise  from  the  solution  of  elliptic  partial  differential  equations.  This  report  describes  an  experimental 
study  undertaken  to  investigate  the  use  of  the  Crystal  distributed  computing  facility  [1]  to  implement  two  of 
these  methods.  The  two  methods  chosen  for  study  are  the  red/black  Successive  Over- relaxation  (SOR) 
method  and  the  MGR|. i  multigrid  algorithm. 

Red/black  SOR  can  be  completely  distributed  to  a  number  of  machines  and  the  algorithm  itself  is  very 
easy  to  implement,  hence  it  is  a  good  test  algorithm.  However  p,  the  rate  of  convergence  (contraction 
number)  for  SOR  is 

p  -  l  -  ch 

where  c  is  some  constant  independent  of  h.  This  rate  becomes  abysmally  slow  as  h- 0.  This  slow  rate  sug¬ 
gests  the  use  of  much  faster,  although  significantly  more  complicated,  algorithms  such  as  multigrid. 

The  MGR|i'|  multigrid  algorithm  presents  an  added  challenge  beyond  the  details  of  the  SOR  algorithm. 
On  the  one  hand  multigrid  algorithms  have  rates  of  convergence  which  are  bounded  away  from  one  by  con¬ 
stants  independent  of  h.  In  fact,  for  Poisson’s  equation  in  a  square  the  asymptotic  rate  of  convergence  p(v) 
of  the  two  grid  MGR[i]  algorithm  sausfies 

l,(l  )  '  2  |2  j-‘l  -1  as  h~°  ■ 

This  is  clearly  superior  to  the  rate  of  convergence  for  the  SOR  algorithm.  On  the  other  hand  achieving  this 
raie  of  convergence  require'  that  much  more  work  be  performed.  Indeed,  from  a  distributed  computing  point 
of  view  there  is  a  stage  of  the  mulhgrid  algorithm  which  is  performed  sequentially. 

One  can  reasonably  ask:  Given  the  Crystal  distributed  computing  facility  can  the  completely  distribut¬ 
able  SOR  algorithm  be  made  faster  than  the  MGR|i']  algorithm?  Before  answering  this  question  decisions 
need  to  be  made  about  the  implementation  of  both  of  these  algorithms. 

This  report  ha'  the  following  organization:  Section  two  describes  the  specific  differential  equation  and 
r "'lilting  linear  system  that  was  actually  solved.  Section  three  de'cribes  the  Crystal  multicomputer  and  the 
modifications  to  the  sequential  algorithms  that  are  nece'sarv  for  their  implementation.  Secuons  four  and  five 
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2.  The  Problem 


The  following  analytic  problem  was  chosen  for  testing,  by  the  two  methods  of  solution. 

Find  l<(x  ,  v)  such  that 

~  &  "  $  ~  Ofor(x.y)  f  II,  (2.1) 

u(x  ,v)  0  for  (x  ,v)  "11 

and  11  ■* 

Although  the  answer  to  this  problem  is  obvious,  u  -  0,  when  studying  iterative  methods  for  the  general  Pois- 
'On  equation  with  Dirichlet  boundary  conditions  this  homogeneous  problem  is  the  only  problem  which  needs 
to  be  considered 

Iti  general  terms  the  method  used  to  solve  (2.1)  is  to  discretize  the  region  11  intc  !!i  and  then  to  use 
'•uher  SOR  ot  niulugrtd  to  solve  the  resulting  linear  system  of  equations. 

The  first  Mep  is  to  define  il„,  the  grid  used  to  approximate  il.  !li;  is  constructed  by  first  choosing  N 
i  N  «dd).  the  number  of  points  on  a  side  of  11,  and  setting  h,  the  mesh  width,  equal  to  y  .  Then 

(x,  ,v.)  ■  11  I  x  -  i/t,  v  in,  1  ■=•  i,j  N 
for  reasons  rhat  will  become  clearer  later,  11;,  is  divided  into  two  subsets  ll£  and  11$  (R  corresponds  to  ‘red’ 
points  and  B  corresponds  to  ‘black’  points).  11$  and  11$  are  defined  by 

11$  -  j(x, ,»)f  Hi,  |  / -  1  mod  2 

and 


|(jr,v)>  R:  |  0  <  jt , v  <  1 


11$ 


(jr. ,  v,  )■  Hi,  I  i-  j  *=0  mod  2 


The  linear  system  that  arises  from  discretizing  (2.1)  is,  for  all  (x, ,  v, ) •  Hi, 

m  ,  1  o 


i 

hr 


U,  1.,  -  u...  i  -  u,.. .  1  -  U  -  -  4( 


(2.2) 


with 


U,  ,  0  for  ;  0,  N  ■ 


0.  N- 
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The  notation  U,.,  corresponds  to  U(x:,\j),  etc.  Since  the  right  hand  Mde  of  equation  (2.2)  is  zero  the  -^r 

term  is  neglected.  The  system  of  linear  equations  (2.2)  is  urinen  in  matrix  form  as  AU  =  0,  where  U  is  the 
vector  of  unknowns  corresponding  to  points  in  11/,.  The  solution  to  this  system  of  equations,  U,  gives  an 
approximation  to  u(x,y),  the  solution  of  (2.1),  which  is  an  O(h-)  approximation. 


3.  Brief  overview  of  Crystal 


The  Crystal  multicomputer  is  composed  of  a  network  of  VAX  750  computers,  each  of  which  can  com¬ 
municate  with  each  other.  Communication  between  machines  is  accomplished  by  means  of  messages.  This 
report  does  not  describe  the  technical  details  of  the  message  transfers;  [1]  and  [2]  describe  these  details. 

However,  from  an  algorithm  design  viewpoint  a  number  of  details  are  important.  Each  message  can 
consist  of  up  to  512  words  (2048  bytes)  of  information.  Once  a  machine  sends  a  message  it  is  free  to  proceed 
with  new  work.  On  the  receiving  end,  the  messages  are  placed  in  a  buffer  until  they  are  read  by  the  receiv¬ 
ing  machine.  If  no  message  has  arrived  when  the  receiving  machine  is  ready  for  one  then  it  must  wait  for  a 
message  to  arrive.  The  ratio  of  transmission  time  of  messages  to  computational  time  (multiplications  etc.)  is 
large,  thus  it  is  to  an  algorithm’s  advantage  to  do  as  many  computations  as  possible  between  sending  a  mes¬ 
sage  and  wailing  for  a  response.  Otherwise  the  machine  must  busy  wait  for  a  message  to  arrive. 

Connected  to  the  Crystal  multicomputer  are  a  number  of  VAX  780  computers,  referred  to  as  hosts. 
The  individual  machines  in  the  network  of  VAX  750  computers  are  referred  to  as  node  machines.  To  run  a 
particular  experiment  the  experimenter  is  able  to  communicate  with  the  node  machines  from  the  host 
machines.  For  example,  in  this  application  the  problem  size  and  other  parameters  are  sent  to  the  node 
machines  which  then  proceed  to  solve  the  problem,  either  by  the  red/black  SOR  algorithm  or  by  the  MGR(r] 
algorithm.  One  particular  node  machine,  for  this  application,  is  called  the  master  node  and  the  other  nodes 
are  called  slave  nodes.  The  master  node  communicates  with  the  host  machines  and  takes  care  of  various 
bookkeeping  tasks,  such  as  timing  the  algorithm. 

3.1.  Implementation  on  Crystal 

The  basic  idea  used  to  solve  the  discrete  problem  (2.2)  on  Crystal  is  to  decompose  the  region  1);,  into  a 
number  of  smaller  regions  l!if  and  to  assign  each  region  ilk  to  a  different  processor  *.  The  decomposition  of 
Hi  is  accomplished  by  choosing  the  number  of  machines  to  use,  M,  and  then  splitting  11;,  into  horizontal 
strips  each  of  height  1/M.  Thus,  12;,  !2ij  tj  12^  (j  ■  •  •  (j  ii/P  where 
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iU  - 


(■*<  >X/)  f  tth  1  -|yf-  -  Jh  JX 


(3.1) 


Figure  1  shows,  for  N  =7  and  M  =  3,  (1^,  corresponding  io  machine  k  for  k  =  1,  2,  3. 

machine  1 

machine  2 

machine  3 

Note  that  the  number  of  points  per  processor  does  not  have  to  be  the  same.  In  addition  in  our  implementa¬ 
tion  an  odd  number  of  machines  is  required  to  insure  that  inter-machine  boundaries  do  not  lie  on  grid  points. 

3.2.  Data  Flow 

One  can  apply  the  concept  of  data  flow,  see  [3)  for  example,  to  study  our  particular  implementation  of 
each  iteration  of  the  red/black  SOR  algorithm  on  Crystal.  The  data  flow  concept  involves  examining  what 
unknowns  are  actually  being  determined  and  what  is  the  flow  of  data  through  the  processors. 

For  our  problem  it  is  important  to  realize  that  the  only  true  unknowns  are  those  points  lying  on 
machine  boundaries.  Once  these  points  are  Fixed  then  the  points  in  the  interior  of  11/f,  for  each  iteration,  are 
determined  by  the  iteration  process  itself.  In  the  eventual  solution  of  (2.2)  these  interior  points  are  deter¬ 
mined  by  the  uniqueness  of  the  solution,  which  is  implied  by  the  maximum  principle.  Therefore  the  object 
of  our  parallel  algorithm  is  to  determine  the  values  for  these  points.  From  this  point  of  view  the  role  of  the 
points  lying  inside  each  (If  is  solely  to  update  these  boundary  values. 

The  task  of  each  machine  is  thus  to  input  boundarv  data  and  then  to  output  updated  values.  Each 
machine  proceeds  independents  and  can  be  viewed  as  a  single  instruction  -update  boundary  values.  These 
computational  processes  art  triggered  solely  by  the  flow  of  data  through  the  network.  There  is  no  global  syn¬ 
chronization  or  control  over  the  algorithm  with  the  exception  of  starting  and  stopping. 


Figure  1  -  f 2 if 
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An  important  advantage  in  looking  at  the  algorithm  in  this  light  is  that  the  particular  algorithm  is 
independent  of  the  architecture  it  is  programmed  in.  For  example  the  only  change  necessary  in  moving  from 
the  distributed  Crvstal  network  to  a  shared  memory  machine  is  that  rather  than  physically  transferring  boun¬ 
dary  values  one  would  only  have  to  mark  the  boundary  rows  as  being  free  for  the  next  machine  to  update. 

Of  course,  for  each  machine  the  interior  points  serve  as  an  initial  guess  for  the  next  iteration.  This  has 
the  practical  implication  that  after  the  first  iteration  the  machines  are  no  longer  interchangeable  over  il/,. 


4.  The  SOR  Method 


The  successive  over-relaxation,  SOR,  method  has  been  studied  by  many  authors,  see  [4,5,6],  The 
red/black  SOR  method  is  characterized  by  its  use  of  the  decomposition  of  11/,.  into  UP  and  U/P.  Although 
red/black  line  SOR  (SLOR)  also  convenientlv  lends  itself  to  the  Crystal  architecture,  red/black  point  SOR  is 
used  instead  since  the  decomposition  of  11/.  into  11/?  and  into  UP  is  also  necessary  for  the  MGR[v]  algorithm 
described  in  section  five.  This  allows  us  to  compare  two  methods  for  the  solution  of  (2.2)  based  on  the  same 
basic  iterative  scheme.  In  [7]  the  red/black  splitting  i«  applied  to  vector  processors. 

The  iterative  scheme  for  our  problem,  for  a  given  value  of  ui  and  initial  guess  U°,  is: 

Repeat  for  each  k,  k  =  1,  2,  3,  •  •  •  until  convergence; 

First  update  points  in  11/P: 

For  all  (jt,  ,y, )  f  11/P,  set 


Ur .]  1  :  -  u,  j  UK  -  U,\j  - ,  -  U,\,.  i  -  UK-i.., 
Then  update  points  in  11/P: 

For  all  (nr,  ,  v, )  s  UP,  set 


4  -  (1  -u))U,\j 


UK}  :  :=*  i»  |WA-1,.-  -  UK.  1-1  -  UK:-  i  *  Uh-\.,  I  4  -  (1  -w)U,\,. 

Note  that  while  updating  points  in  U/P,  U,k  i.,,  Uh.,  i  ,  U,\,-  l  and  UK  \,i  are  all  in  11/P;  similarly 
the  reverse  is  true  while  updating  points  in  U/P. 


The  algorithm  is  completely  described  by  the  above  description  except  for  the  choice  of  u>.  For 
0  <  in  <  2  the  SOR  algorithm  converges  [4],  In  addition,  for  our  problem  the  optimal  m  is  given  by 

where  is  the  largest  eigenvalue  of  the  Jacobi  iteration  matrix.  For  general  problems  the  value  of  ix  is  not 
known  and  an  iterative  procedure  must  be  used  to  determine  u)oparna/.  However,  in  this  case  it  is  known  that 


M-  cost:  h. 

In  addition,  for  «.>  the  rate  of  convergence  of  the  red/black  SOR  method  is 
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1  -  ?in77  h 
1  -  >imrh  • 
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4.1.  Implementation  on  Crystal 

We  have  alteadv  seen  that  the  basic  idea  used  to  exploit  the  Crystal  architecture  is  the  concept  of 
domain  decomposition.  W'hat  is  the  interaction  between  machines  in  terms  of  the  SOR  iteration?  What  is  the 
best  reorganization  of  the  algorithm  to  minimize  the  effects  of  the  decomposition? 

E.ich  point  L’,,  that  is  updated  using  the  SOR  algorithm  depends  only  upon  its  four  nearest  neighbors. 
For  points  interior  to  a  given  region  11/ ?  these  four  neighbors  lie  completely  within  SI/)-.  However  for  boun¬ 
dary  points  of  llfi  (either  the  top  or  the  bottom  row)  one  of  the  neighbors  lies  in  either  1  (points  along 
the  top  of  11,4)  or  in  *  1  (points  along  the  bottom).  Before  each  iteration  machine  (  must  theiefore  receive 
these  boundary  values  from  machines  *.  -  1  and  -  1.  Similarly  machine  x  must  send  its  boundary  values  to 
its  two  neighboring  machines.  Machines  1  and  M  have  only  one  neighboring  machine  each,  theiefore  there 
is  only  one  boundary  transfer  for  these  two  machines. 

In  order  to  “hide”  the  message  transfers  rather  then  vending  boundary  data  and  then  waiting  idlv  for 
the  boundaries  to  arrive  from  neighboring  machines,  the  interior  points  are  First  updated.  Then,  the  points 
along  the  boundary  of  l!r  are  updated  after  receiving  the  boundary  points  from  machines  k  -  1  and  k~  1.  In 
this  way  we  hope  that  the  time  required  to  update  the  interior  points  will  be  larger  than  the  time  required  for 
the  boundaries  to  arrive;  otherwise  we  must  busy  wait  for  them. 

This  procedure  requires  that  rows  in  the  original  domain  1!(  that  are  boundaries  of  partitions  il/f  be 
represented  twice.  For  example,  the  top  row  of  Hr,-  is  updated  bv  machine  *  and  is  also  the  bottom  row,  and 
hence  used  as  boundary  data,  in  machine  k  —  I. 

The  decomposition  of  11,  into  'red'  points  and  'black'  points  allows  machine  *  to  work  simultaneously 
with  the  other  machines.  Since  the  'red'  points  depend  upon  values  Fixed  in  the  ‘black’  points,  we  have 
already  seen  that  all  the  'ted'  points  can  be  updated  simultaneously.  The  only  sharing  of  information  is  the 
values  of  'black'  points  along  machine  boundaries.  Similarly,  to  update  the  ‘black’  points,  the  ‘red’  points 
are  held  Fixed,  and  again  only  the  points  .pong  machine  boundaries  need  to  be  ev  hanged. 
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In  the  case  of  lexicographical  SOR,  to  update  a  given  point,  say  U,  tJ ,  points  and  U,  j-  i  must 

be  updated  first.  In  particular,  if  the  point  U,  is  in  machine  k,  all  of  the  points  in  machine  k  - 1  must  be 
first  updated  before  updating  Similarly  the  points  in  machines  k-2,  k  -3,  etc.  must  be  updated  before 
points  in  machine  £  - 1 .  This  loses  the  effect  of  distributing  the  computational  work  among  a  number  of 
machines.  Hence,  a  global  decomposition  of  fl*,  for  example  the  red/black  decomposition,  is  necessary  for 
the  Crystal  implementation  to  succeed.  In  addition,  since  only  points  along  machine  boundaries  are  shared, 
one  complete  red/black  cycle  requires  only  two  message  transfers. 

On  Crystal  the  red/black  SOR  iteration  for  machine  /,  with  Ni  rows  of  unknowns,  is: 

Given  ui  and  (7°: 

Repeat  for  k  =  1 ,  2,  3,  •  ■  •  until  convergence: 

1.  Send  boundary  data  to  neighbors. 

2.  Compute  new  UL"  ]  at  interior  points,  rows  2,  •  •  ■  Ni~  1. 

3.  Receive  boundaries  from  machines  /-  1  and  /-  1. 

4.  Compute  new  Uk~  1  at  boundary  points  of  11/,. 

5.  If  a  complete  red/black  cycle  has  been  performed,  compute  residual. 

It  is  important  to  realize  that  the  iterates  computed  by  this  distributed  form  of  the  red/black  SOR  algo¬ 
rithm  are  exactly  the  same  as  the  iterates  computed  by  the  serial  version  of  the  red/black  SOR  algorithm. 
This  allows  us  to  easily  compare  the  savings  made  by  using  the  multicomputer. 

A  number  of  specific  details  about  the  programming  of  the  algorithm  are  of  interest.  To  control  the 
various  processors  corresponding  to  each  region  11/  an  additional  processor  is  used.  This  additional  proces¬ 
sor  collects  the  norms  of  the  residuals  from  each  machine  k.  The  total  norm  over  Hi,  is  computed  and  sent 
to  the  host  machine.  In  addition  this  machine  provides  each  machine  k  with  the  necessary  starting  informa¬ 
tion  (number  of  points,  value  of  ui)  and  signals  convergence  to  each  machine  k. 

In  order  to  keep  the  messages  straight  between  machines,  each  message  includes  a  synchronization 
number  from  us  sender.  In  addition,  when  the  messages  arrive  at  their  destinations  the  originator  of  each 
mewge  is  known.  For  example,  machine  k  working  on  iteration  /  waits  for  messages  sent  from  machines 
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k-  1  and  k*\  each  labeled  with  synchronization  number  /.  In  the  implementation  of  the  red/black  SOR 
algorithm  this  synchronization  number  corresponds  to  the  itei anon  count  of  each  machine.  In  case  messages 
arrive  too  soon,  for  example  from  iteration  /-l,  then  the  messages  are  buffered  until  needed. 


A  final  consideration  concerning  the  implementation  is  worth  noting.  Although  FORTRAN  is  available 
on  Crystal,  the  language  Wisconsin  Modula  is  used  instead.  This  choice  is  made  to  facilitate  the  buffering  of 
the  messages  and  because  of  Modula’ s  superior  choice  of  data  structures.  Since  we  are  mainly  interested  in 
speedup  and  efficiency  for  this  model  implementation  the  consideration  of  which  language  to  use,  while 
important,  does  not  change  the  conclusions  based  on  the  experimental  results. 


4.2.  Experimental  Results 

The  algorithm  of  the  previous  section  was  programmed  and  tested  on  the  Crystal  multicomputer.  In 
order  to  compare  the  distributed  version  of  the  algorithm  to  the  serial  (single  machine)  version  of  the  algo¬ 
rithm  a  number  of  definitions  are  required. 

Let  Tp  be  the  time  required  to  run  the  algorithm  using  p  machines.  Then  the  speedup,  Sp,  is 


The  efficiency,  Ef, ,  is 


For  this  particular  distribution  of  the  work  per  iteration  among  the  p  machines  in  use,  the  minimum  time 
required  to  run  the  algorithm  is  ,  so 


T 1 


Hence,  E„  satisfies 


c  —  _  T\  j 

Ep~  f  -  plj  >• 

One  hopes  for  EP  to  be  as  close  to  one  as  possible,  however  as  we  will  see  this  is  not  always  possible. 

To  measure  the  time  required  by  the  algorithm,  the  additional  node  which  collected  the  norms  from 
each  piece  timed  the  algorithm.  By  using  a  Crystal  node  machine  to  collect  the  times  no  allowances  had  to  be 


11 


made  for  other  users  on  the  system.  As  an  additional  measure  to  insure  accuracy  three  runs  were  made  for 
every  choice  of  N  displayed  in  the  tables.  The  deviation  in  time  between  runs  was  very  small  which  supports 
the  result  that  all  of  the  measured  time  was  due  to  the  computations  and  not  due  to  network  traffic  or  other 
extraneous  factors. 


Runs  were  made  with  N  equal  to  15,  31,  63  and  127.  This  corresponds  to  225,  961,  3969  and  16129 
unknowns  respectively.  This  may  appear  to  be  an  unreasonable  number  of  variables,  however  when  using  a 
larger  or  more  complicated  domain  11  these  are  realistic  sized  problems.  In  addition  to  varying  N,  the 
number  of  processors  p  equaled  one,  three,  five,  seven,  nine  and  eleven. 

Tables  I  through  IV  contain  the  results  for  N  =  15,  31,  63  and  127  respectively.  The  column  labeled 
maximum  number  of  rows  shows  the  number  of  rows  in  the  largest  division  of  11;,.  Recall  that  the  number 
of  rows  per  machine  is  not  required  to  be  the  same.  The  machine  with  the  largest  number  of  rows  dominates 
the  computation  so  it  is  important  to  compare  the  results  for  this  machine. 


Number 

of 

Machines 

Maximum 
number  of 

Rows 

Average 

Time 

■  Speedup 

Efficiency 

1 

15 

9.55 

1.00 

3 

5 

7.28 

1.31 

5 

3 

1.40 

i  7 

3 

6.73 

1.42 

9 

2 

6.28 

1.52 

!  ii 

2 

6.16 

1.55 

Table  1  -  N  =  15 


Number 

of 

Machines 


Maximum 
number  of 
Rows 


Average 

Time 

Speedup 

— 

Efficiency 

57.41 

26.51 

2.17 

0.72 

19.49 

2.95 

0.59 

3.57 

0.51 

\  14.33 

0.45 

12.08 

4.53 

0.41 

31 
1 1 


Number 

of 

Machines 

Maximum 
number  of 

Rows 

Average 

Time 

Speedup 

Efficiency 

1 

63 

441.99 

1.00 

1.00 

3 

21 

160.84 

2.75 

0.92 

5 

13 

105.07 

4.21 

0.84 

7 

9 

77.20 

5.73 

0.82 

9 

7 

63.19 

6.99 

0.78 

11 

6 

5b.  34 

7.85 

0.71 

Table  III  -  N  =  b3 


Number 

Maximum 

-  1  1 

of 

number  of 

Average 

Speedup 

Efficiency 

Machines 

Rows 

Time 

Table  IV  -  N  =  127 

Figure  2  contains  a  plot  of  the  speedup  versus  number  of  machines,  while  Figure  3  displays  the  effi¬ 
ciency  versus  number  of  machines.  As  can  readilv  be  seen,  for  large  problems  the  results  are  very  encourag¬ 
ing.  Indeed  for  N  =  127  the  algorithm  remains  over  85%  efficient.  This  indicates  that  the  message  transfer 
time  is  successfully  dominated  by  the  time  required  for  computations.  However,  for  small  problems  the  effi¬ 
ciency  rapidly  drops  off,  which  indicates  that  this  form  of  distributing  the  algorithm  is  not  worthwhile  for 


small  problems. 


5.  The  Multigrid  Method 


The  multigrid  algorithm  for  the  solution  of  (2.1)  has  been  studied  by  many  authors,  see  [8,9,  10,  11). 
Multigrid  is  a  term  used  to  describe  an  iterative  technique  which  uses  auxiliary  grids  which  usually  have  sig¬ 
nificantly  fewer  points  than  the  original  grid.  We  will  not  attempt  to  describe  all  multigrid  algorithms  here, 
but  rather  will  describe  the  one  algorithm  implemented  on  the  Crystal  multicomputer. 

The  algorithm  chosen  for  implementation  is  known  as  the  MGR[v]  algorithm.  This  algorithm  was  first 

/ 

described  by  Braess  [  1 2]  (algorithm  2.1  in  his  paper)  who  analyzed  the  two  grid  version  of  what  became 
known  as  MGR[0)  for  the  Poisson  equation  in  a  general  polygonal  domain.  Ries,  Trottenberg  and  Winter 
[13]  later  analyzed  the  algorithm  for  the  Poisson  equation  in  a  square  for  arbitrary  r.  Their  results  agree 
with  the  result  of  Braess  for  the  case  v  =  0.  Kamowitz  and  Parter  [14]  extended  the  previous  results  for  two 
grids  for  the  MGR(0]  case  to  the  variable  coefficient  diffusion  equation  in  a  general  polygonal  domain.  And 
finally  Parter  [15]  extended  the  results  of  Ries  Trottenberg  and  Winter  and  of  Braess.  He  proved  that  the 
three  grid  rate  of  convergence  in  a  general  polygonal  domain  for  MGR[0]  for  the  variable  coefficient  diffu¬ 
sion  equation  is  the  same  as  the  rate,  l>  -  7  -“O (h),  for  the  two  grid  algorithm. 

5.1.  The  MGR[v]  Algorithm 

In  order  to  completely  describe  the  MGR[v]  algorithm  a  number  of  spaces,  operators  and  parameters 
need  to  be  defined.  In  brief,  each  multigrid  iteration  consists  of  a  small  number  of  smoothing  iterations,  the 
transfer  of  the  residual  to  a  coarser  grid,  the  solution  of  a  related  system  of  equations  to  compute  the  “coarse 
grid  correction”  and  the  updating  of  the  smoothed  values  in  the  original,  fine,  grid  by  interpolating  the 
coarse  grid  correction  to  the  fine  space.  It  should  be  noted  that  the  multigrid  algorithm  itself  can  be  used 
recursively  to  compute  the  coarse  grid  correction;  this  leads  to  a  true  multigrid  algorithm.  Also,  additional 
smoothing  steps  can  be  done  to  the  coarse  grid  correction  while  interpolating  to  finer  grids. 

First  the  general  MGR[v]  multigrid  algorithm  will  be  presented,  then  the  details  concerning  each  stage 
of  the  algorithm  will  be  described.  In  terms  of  the  implementauon  on  Crystal  only  a  rudimentary  understand¬ 
ing  of  the  algorithm  is  necessary.  However  the  details  are  important  in  terms  of  the  actual  performance  on 


Crystal. 


The  MGR[i'l  algorithm  uses  t  nested  grids,  where  i  is  selected  in  advance  of  running  the  algorithm. 
The  nested  sequence  of  grids  is  labeled  Hi  D  11;  D  ...  11,  where  Hi  corresponds  to  11;,  of  section  2.  Associ¬ 
ated  with  each  grid  H<  is  a  positive  definite,  symmetric  operator 

Lk  :Slk  -ft*. 

To  solve  L\  U\  -  f \,  where  L\  is  the  linear  operator  defined  in  equation  (2.2)  and  f\  is  0  for  our  par¬ 
ticular  problem  the  following  algorithm  is  used. 

Set  k  :=  1 ,  (A  :  =  initial  guess. 

Algorithm  MG(Lt,  Uk,  fk,  k): 

(  Lk  given  positive  definite  symmetric  operator, 

Uk  given  initial  guess,  returns  value  at  next  iteration, 
fk  right  hand  side, 
k  grid  layer) 

(1)  Smooth  Perform  v  iterations  of  odd/even  Gauss-Seide)  relaxation  on  the  problem  Lk  Uk  =  fk  followed 
by  one  odd  sweep.  Store  the  results  of  this  step  in  0k. 

(2)  Compute  the  Residual  a 

Set  rL  :  fk  -  Lk  Uk 

Note  that  at  the  odd  points  n,  0. 

(3)  Resina  the  residual  rk  to  Hr .  j: 

Set  fk ,  i  :  --  It  ‘ 1  a 

(4)  Consider  computing  the  Coarse  grid  correaion: 

Find  Uk- 1  such  that /-r- i  Cr .  i  -•  /,  .  i . 

If  k  -  I  /  solve  directlv  (i.e.  return  U,  -  L, ' 1  f,  ). 

Otherwise,  set  U\  -  i  :  0  and  return  MG (Lk  -  i ,  14  »  i  ,/i  -  i  ,*  -  1 ) 

(5)  Interpolate  and  update  : 

Set  Ui  :  -  Uk  —  It  t  Uk .  ■ . 
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(6)  Return  Ut_,  exit  algorithm. 


In  the  above  description  of  algorithm  MG (Z./.  (4  /^,4)  the  details  of  the  operators  I*,  /jf*1  and  i 
were  deliberately  left  out.  Indeed,  with  the  exception  of  Hi  which  corresponds  to  11/,,  the  spaces  ll/i, 
k  =  2,  3,  •  •  •  t  have  not  yet  been  defined. 

For  clarity  only  the  particulars  for  the  two  grid  algorithm  will  be  described  fully.  The  details  for  the 
full  t  grid  algorithm  extend  readily  from  the  two  grid  description. 

5.1.1.  Additional  Details  of  the  MGR[i  ]  Algorithm 

Coarse  Grid  Spaces 

Given  N],  recall  the  definition  of  Hi 

111  =  11/,  =  |(Jt,,y,)  (  11  |  Jr,  -  ih,  y;  =  jh,  1  -s  ij  s  Nj 
Then  the  coarse  grid  spaces  11,,  1  =  3,5,  •  •  •  correspond  to  setting 

Mi*w  :=  2  ^  (iVi  -  1 )  -  1 

and  computing  11/..  with  h  now  equal  to  _T  .  The  coarse  grid  spaces  11/,  /  =  2,  4,  ■  •  •  correspond  to 
the  “black”  points  of  11/  ;. 

Communication  between  Spaces 

The  interpolation  operator  \  is  constructed  as  follows: 

,  :=  Ui.j  if  the  point  (*, , y, )(  lit 

and  for  (x,  ,y})  t  lit  \  Sit*  l  we  require 

U  (tf-i c)  -0.  (5.1) 

i.j 

Note  that  (5.1)  results  in  an  explicit  equation  for  each  point  ( x ,  ,y, )  f  lh  1  il*.  *  i . 

For  the  restriction  operator  f/f  *  1  we  set 

It  '  -  -  ]  (tf-  i)"-  (5-2) 
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In  step  3  of  Algorithm  MG  applying  the  operator  /if*  1  to  the  residual  reduces  to  dividing  the  residual  by  2  on 
points  of  !Un*U-i 

Coarse  Grid  Operators 

As  is  well  known  [11]  the  “ideal”  choice  for  Lu-ri  is 

U*i  := 

With  this  choice  of  Lk*  i  for  the  coarse  grid  operator  the  two  grid  MGR[i  ]  algorithm  converges  in  one  step! 
However,  Lk*  j  is  a  nine  point  operator  as  a  straightforward  calculation  shows.  In  order  to  continue  doing 
odd/even  relaxation  on  each  of  the  coarser  grid  layers  we  require  Lk- 1  to  be  a  five  point  operator. 

More  specifically,  the  stencil  for  Lk  ^  j  ( k  odd)  is  of  the  form 

/N 

L 

k+1 

For  the  MGR|v]  algorithm  we  take  Lk- \  to  be  the  “nearest”  five  points  in  the  stencil  for  Lk*  j.  E.g.,  the 
stencil  for  Lk*  \  (h  odd)  is 

□  □ 

\  / 

□ 

/  \ 

□  □ 

k 

For  k  even,  Lk  -  i  corresponds  to  Li,  ,  where  h  -  2*  h.  It  should  be  noted  that  this  choice  of  Lk*  i  for  even 


L  :  = 

k+1 


O 


o — □ — o 


□ 


o 


numbered  grid  layers  results  in  the  rotated  grids  characteristic  of  the  MGR[v]  algorithm. 


5.2.  Implementing  the  MGRJi'J  Algorithm  on  Crystal 

The  MGR[i’]  algorithm  of  Section  5. 1  was  implemented  on  the  Crystal  multicomputer.  Each  grid 
He,  k  =  1,2,  •  ■  •  ,i  is  partitioned  among  p  processors,  just  as  for  the  red/black  SOR  algorithm  of  section  3. 
Steps  1-3  and  5  of  algorithm  MG  are  local  steps,  while  step  4  is  a  global  step.  By  a  local  step  we  mean  a  step 
of  the  algorithm  where  updating  any  particular  point  requires  only  the  values  of  the  four  nearest  neighboring 
points.  The  coarse  solve  on  11,,  in  step  4,  is  a  global  step  since  the  values  of  the  unknowns  throughout  all  of 
11;  are  required  before  the  coarse  grid  correction  can  be  computed.  The  purpose  of  this  experimental  study 
on  the  Crystal  multicomputer  is  to  determine  whether  there  is  any  combination  of  number  of  grid  layers  and 
number  of  smoothing  iterations  (v)  for  which  the  distributed  portions  of  the  algorithm  effectively  mask  the 
deleterious  effect  of  the  global  solution  step. 

As  the  number  of  grid  layers  increases,  one  hopes  that  the  effect  of  the  global  solution  step  on  the  effi¬ 
ciency  of  the  distributed  algorithm  should  decrease.  However,  as  we  shall  see.  as  the  number  of  grid  layers 
increases,  the  amount  of  work  per  grid  layer  decreases  across  all  machines,  and  eventually  there  is  a  drop  off 
in  the  efficiency  of  distributing  the  local  steps  across  all  the  machines.  For  example,  if  the  coarsest  grid,  iir, 
has  only  one  point  on  it,  then  the  global  solution  step  can  be  accomplished  in  one  arithmetic  operation. 
Unfortunately,  the  local  operations  on  < I ,  are  divided  among  the  p  processors  in  use,  which  means  in  this 
case  that  p  -  1  processors  are  idle.  The  tables  and  graphs  following  this  section  illustrate  this  effect. 

The  details  of  the  implementation  of  the  local  steps  (smoothing,  computing  the  residual,  restricting  to 
coarser  grids  and  interpolating  to  finer  grids)  are  similar  to  the  implementation  of  the  red/black  SOR  algo¬ 
rithm.  That  is,  send  boundary  data,  update  interior  sections,  receive  boundaries  from  neighbors  and  then 
update  the  boundary  values.  These  local  steps  are  repeated  for  each  of  the  /  grid  layers  in  use.  Unfor¬ 
tunately  as  the  number  of  grid  layers  increases,  the  number  of  points  per  grid  layer  decreases,  and  eventually 
the  communication  time  for  each  step  of  the  algorithm  on  these  coarser  grids  dominates  the  computational 
time. 

In  addition  to  the  local  steps  of  the  algorithm  the  coarse  grid  correction  of  step  4,  on  the  coarsest  grid, 
requires  knowledge  of  the  remaining  unknowns  in  all  of  !),.  These  unknowns  are  distributed  among  all  p 


processors  in  use.  For  this  step  each  machine  vends  its  unknowns  to  an  additional,  dedicated,  node.  This 
rr.asier  node  collects  the  unknowns  from  each  of  the  slave  nodes.  These  slave  nodes  correspond  to  the  nodes 
used  by  the  red/black  SOR  algorithm  and  perform  the  local  operations.  Once  each  slave  node  has  sent  its 
unknowns  to  this  dedicated  machine  the  coarse  grid  correction  is  computed  using  Gaussian  elimination.  The 
results  of  the  coarse  grid  correction  computed  by  the  master  node  are  then  redistributed  to  each  slave  machine 
and  the  algorithm  continues. 

Of  coarse  a  more  easily  parallelizable  algorithm,  such  as  red/black  SOR  could  have  been  used  to  solve 
the  coarse  grid  problem.  This  would  have  increased  the  observed  efficiency  of  the  experimental  study.  How¬ 
ever,  the  cost  would  have  been  a  slower  running  algorithm  since  for  small  sized  problems  (such  as  the  coarse 
grid  correction  when  using  a  fair  number  of  grids)  gaussian  elimination  is  faster.  The  question  of  what  tech¬ 
nique  to  use  to  solve  the  coarse  grid  equation  requires  further  study. 

The  coarse  grid  correction  step  results  in  the  most  serious  bottleneck  of  the  distributed  version  of  the 
algorithm.  During  this  step  each  node  must  remain  idle  while  the  coarse  grid  correction  is  computed.  If  the 
number  of  points  in  11,  is  large  (e.g.  if  t  -  2  or  3  for  example),  then  this  step  dominates  the  computational 
time  of  the  algorithm.  However,  for  the  special  case  when  the  coarsest  grid  contains  only  one  point,  this 
transfer  of  unknowns  to  the  master  node  is  eliminated.  For  this  special  case  the  one  node  containing  the 
coarse  grid  (with  only  one  point  in  it)  computes  the  coarse  grid  correction  itself. 

In  addition  to  the  bottleneck  resulting  from  the  coarse  grid  correction,  there  is  another  load  balancing 
problem  inherent  in  implementing  the  MGR[i  ]  algorithm  on  Crystal.  As  the  grids  get  coarser  eventually 
there  are  more  machines  in  use  than  there  are  rows  in  the  coarser  grids.  This  results  in  the  situation  where 
vome  of  the  machines  have  no  work  to  do  and  hence  must  sit  idle  for  some  portion  of  the  algorithm.  From  a 
practical  programming  point  of  view  this  results  in  the  added  complication  of  keeping  the  machines  synchron¬ 
ized  throughout  the  iteration. 

5.3.  Experimental  Results 

The  MGR|i  |  algorithm  as  previously  described  was  implemented  and  tested  on  the  Crystal  multicom¬ 
puter  Tests  were  made  with  N  -  15,  31,  63  and  127  and  with  v  equal  to  0,  1  and  2.  The  number  of 


processors  p  used  to  perform  steps  1-3  and  5  of  algorithm  MG  was  equal  to  1,  3,  5,  7,  9  and  1 1.  With  the 
exception  of  the  case  p  1  and  the  case  where  !).  has  one  point  in  it  one  additional  processor  was  used  to 
compute  the  coarse  grid  correction.  Thus,  the  total  number  of  processors  used  when  (1,  had  more  than  one 
point  in  it  equaled  1,  4,  6,  8,  10  and  12.  When  il,  had  only  one  point,  the  number  of  processors  used 
equaled  1,  3,  5,  7,  9  and  11.  Unfortunately  due  to  physical  constraints  on  the  amount  of  memory  in  the 
node  machines  some  combinations  of  the  above  parameters  could  not  be  tested  and  these  cases  are  noted  in 
the  tables  found  in  the  appendix.  Also,  the  single  machine  tests  with  N  127  were  run  on  a  lightly  loaded 
VAX  750  (»he  same  type  of  VAX  as  a  node  machine)  running  UNIX  By  way  of  comparison  a  few  runs 
were  made  with  N-  63  on  both  the  VAX  running  UNIX  and  on  a  node  machine.  The  times  from  both 
machines  agreed  to  within  a  few  seconds. 

The  purpose  of  our  experimental  study  on  the  Crystal  architecture  is  to  determine  what  the  optimal 
choice,  if  indeed  there  is  one,  of  p  and  e  are  for  a  particular  size  problem.  To  gain  insight  into  this  question 
it  is  worthwhile  to  look  at  both  the  observed  rate  of  convergence  and  the  distribution  of  computational  work 
between  the  easily  distr  ibuted  steps  of  the  algorithm  and  the  coarse  solve  step 

The  appendix  contains  the  full  set  of  observed  CPU  times  and  efficiencies  for  all  the  test  problems  For 
expository  simplicity  onlv  the  case  N  =  63  will  be  disclosed  in  this  section.  This  case  contains  the  full  range 
of  the  parameters  r  and  number  of  grids  and  is  represer  anve  of  the  other  sized  problems. 

Figure  1  displays  the  observed  rate  of  convergence  for  A’  =  t>3  and  for  i<  =  0,  1  and  2.  Note  that  for  2 
and  3  grids  the  observed  rate  of  convergence  is  indeed  1  unded  above  by  the  predicted  rate  of 


However,  for  e  =  I  and  2  there  is  verv  little  change  in  the  rate  of  convergence  as  the  number  of  grids 
increases.  This  is  in  some  ways  counter-intuitive  and  requires  further  theoretical  investigation. 

After  observing  the  rate  of  convergence  for  each  test  case  a  crude  count  of  the  computational  work  of 
the  algorithm  was  made.  Since  only  a  rough  estimate  of  the  work  is  of  interest,  I  “work  unit”  was  assigned 
to  each  unknown  at  each  step  of  the  algorithm.  For  example,  with  n  points  on  a  grid,  n  “work  units”  were 
counted  during  the  smoothing  step  rather  than  5 n  floating  point  operations  which  is  formallx  more  correct  for 
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one  iteration  of  Gauss-Seidel  smoothing.  The  computational  work  required  for  convergence  for  each  sized 
problem  is  proportional  to  the  size  of  the  problem. 

Figure  2  displays  a  graph  of  the  total  computational  work  for  N  -  63  and  figure  3  displays  a  graph  of 
the  ratio  of  work  for  the  coarse  solve  step  to  the  total  computational  work.  Notice  that  the  amount  of  work 
being  done  in  the  coarse  solve  step  falls  off  rapidly. 

Figures  4-6  display  graphs  of  the  observed  efficiency  for  v  =  0,  1  and  2.  Each  line  corresponds  to  a 
different  number  of  grid  layers.  The  bottom  line,  representing  the  least  efficient  case  displays  the  observed 
efficiency  for  2  grids,  while  the  bold  line  displays  the  observed  efficiency  for  the  special  case  where  SI,  con¬ 
tains  only  one  point.  Recall  that  in  this  case  the  master  node  does  no  work,  so  the  number  of  processors  used 
is  simply  the  number  used  for  steps  1  -3  and  5  of  the  algorithm. 

As  the  number  of  grids  used  increases,  the  efficiency  coalesces.  This  is  not  surprising  since  beyond 
using  a  small  number  of  grid  layers  there  is  not  much  difference  in  the  amount  cf  work  being  performed  with 
respect  to  the  number  of  grids  used  (see  figure  2). 

Unfortunately  from  a  distributed  algorithm  point  of  view  as  the  number  of  machines  increases  the  effi¬ 
ciency  drops  off  steadily.  This  particular  implementation  of  the  MGR|v]  algorithm  is  caught  in  the  bind  of 
either  having  too  much  work  to  do  solving  the  coarse  grid  equations  or  having  too  little  work  to  do  on  the 
coarser  grids  while  smoothing,  computing  the  residual,  etc  In  addition,  having  to  wait  idly  for  the  coarse 
grid  equations  to  be  computed  is  another  limiting  factor  in  terms  of  increasing  the  efficiency  of  the  algorithm. 
In  the  special  case  where  11,  has  one  point,  this  problem  is  somewhat  alleviated,  as  can  be  seen  in  figures  3- 
5.  Yet,  as  the  number  of  machines  increases,  even  in  this  case  the  amount  of  work  remaining  to  be  distri¬ 
buted  among  the  processors  is  too  small  to  effectively  use  them  all  efficientlv. 


Figure  1 
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6.  Concluding  remarks 


Two  approaches  were  presented  for  solving  problem  (2.2).  The  first  approach,  red/black  SOR,  was 
easy  to  implement  on  the  Crystal  architecture  and  the  experimental  results  in  terms  of  the  observed  efficiency 
for  this  algorithm  were  very  encouraging. 

The  second  approach,  the  MGR[v]  algorithm,  was  much  more  difficult  to  implement  on  the  Crystal 
architecture.  Alas,  this  particular  implementation  did  not  succeed  in  terms  of  high  efficiency.  Indeed  this 
lack  of  high  efficiency  appears  to  be  inherent  in  the  algorithm  itself. 

There  is  one  important  saving  grace  in  the  MGR[v]  algorithm.  While  it  might  not  lend  itself  to  a  distri¬ 
buted  implementation,  even  the  serial  version  is  much  faster  than  the  red/black  SOR  algorithm.  With  v  =  1, 
and  the  appropriate  number  of  grids  (depending  upon  N),  the  MGR[v]  algorithm  was  up  to  seventeen  times 
faster  than  the  serial  version  of  the  red/black  SOR  algorithm.  In  practical  terms,  this  means  that  for  a  100% 
efficient  parallel  implementation  of  the  red/black  SOR  algorithm  to  be  competitive  with  the  MGR|v]  algorithm 
at  least  seventeen  machines  must  be  used  for  every  one  machine  used  for  the  MGR|r)  algorithm. 

6.1.  Suggestions  for  further  work. 

A  number  of  questions  for  further  research  have  been  opened  by  this  study.  The  first  question  is  what 
happens  r  asynchronous  smoothing  is  used?  How  much  degradation  in  the  rate  of  convergence  of  the 
red/black  SOR  and  the  MGR[v]  algorithms,  if  any,  will  there  be?  What  kind  of  theoretical  convergence 
results  ca'  be  expected?  This  approach  has  been  looked  at  by  a  number  of  people,  see  |lb,  17];  however  no 
clear  answer  has  emerged. 

Anouier  question  that  would  perhaps  improve  the  efficiency  of  the  MGR[v]  algorithm  is:  Is  there  some 
way  to  work  on  more  than  one  grid  layer  at  a  time?  For  instance,  perhaps  by  staggering  the  iterations  among 
the  grid  levels  each  machine  could  work  on  a  different  grid  layer,  or  perhaps  even  on  more  than  one  level. 
This  idea  has  been  investigated  by  Greenbaum  (18].  Alternatively,  is  there  some  way  for  some  of  the  idle 
machines  to  perform  useful  work  while  waiting  for  the  soluuon  of  the  coarse  grid  equations? 
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Appendix 


Tables  1-4  display  the  observed  rate  of  convergence  and  the  number  of  iterations  required  for  each  test 
problem.  In  tables  5-9,  a-c  can  be  found  the  observed  CPU  time  for  each  test  problem.  5a  corresponds  to  N 
=  15,  v  =  0,  5b  corresponds  to  N  =  15,  v  =1,  etc.  The  efficiency  for  each  test  case  is  displayed  in  tables 
9-12,  a-c.  Finally,  table  13  contains  the  observed  efficiency  for  the  special  case  where  H,  contains  one  point. 


Observed  rate  of  convergence  /  Number  of  iterations 


Number 

of 

grids 

v  =0 

v  =  l 

v  =2 

2 

.4628  /  10 

.06154  /  3 

.03228 / 3 

3 

.4639  /  10 

.06221  /  3 

.03277 / 3 

4 

.5964  /  13 

.06396  /  4 

.03357  /  3 

5 

.5984  /  12 

.06133  /  4 

.03402  /  3 

6 

.6633  /  15 

.06133  /  4 

.03409  /  3 

7 

.6474  /  13 

.06086  /  4 

.03412  /  3 

Table  1  -  N  =  15 


Observed  rate  of  convergence  /  Number  of  iterations 


Number 

of 

grids 

C 

il 

o 

v  =  l 

II 

N> 

2 

.4585  /  9 

.06582  /  3 

.03546  /  3 

3 

.4591  /  9 

.06583  /  3 

.03564  /  3 

4 

.5909  /  12 

.06667  /  3 

.03641  /  3 

5 

.5931  /  12 

.06585  /  3 

.03645  /  3 

6 

.6950/  15 

.06513  /  3 

.03653  /  3 

7 

.6971  /  15 

.06483  /  3 

.03685  /  3 

8 

.7457  /  18 

.06483  /  3 

.03685  '  3 

9 

.7308  /  16 

.06493  /  3 

.03688  /  3 

Table  2  -  N  =  31 
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Observed  rate  of  convergence  /  Number  of  iterations 


Number 

of 

grids 

v=0 

l’=l 

v  =  2 

2 

.4583  /  9 

.06711  /  3 

.03656  /  3 

3 

.4586  /  9 

.06708  /  3 

.03660  /  3 

4 

.5879  /  11 

.06791  /  3 

.03741  /  3 

5 

.5889  /  1  1 

.06757  /  3 

.03743  /  3 

6 

.6912  /  14 

.06661  /  3 

.03750  /  3 

7 

.6920  /  14 

.06649  /  3 

.03757  /  3 

8 

.7619  /  17 

.06647  /  3 

.03762  /  3 

9 

.7590  /  16 

.06648  /  3 

.03769  /  3 

10 

.7980  /  19 

.06648  /  3 

.03769  /  3 

11 

.7798  /  17 

.06651  /  3 

.03770  /  3 

Table  3  -  N  =  63 


Observed  rate  of  convergence  /  Number  of  iterations 


Number 

of 

grids 

i'  =  0 

r  =  ] 

v  =  2 

2 

na1 

na 

na 

3 

na 

na 

na 

4 

.5879  /  11 

.06844  /  3 

.03781  /  3 

5 

.5884  /  1 1 

.06828  /  3 

.03782 / 3 

6 

.6887  /  13 

.06723  /  3 

.03790  /  3 

7 

.6890  /  13 

.06718  /  3 

.03793  /  3 

8 

.7590  /  16 

.06712  /  3 

.03799  /  3 

9 

.7587  /  16 

.06713  /  3 

.03801  /  3 

10 

.8049  /  18 

.06716  /  3 

.03803  /  3 

11 

.7993  /  17 

.06717  /  3 

.03805  /  3 

12 

.8301  /  17 

.06717  /3 

.03805  /  3 

13 

.8123  /  18 

.06718  /  3 

.03805  /  3 

Table  4  -  N  =  127 


na  means  that  this  particular  run  could  not  be  performed;  usuallv  due  to  size  constraints. 


Observed  solution  time  N  =  15 
v-0 


Number 

of 

Grids 

1 

machine 

3 

machines 

5 

machines 

7 

machines 

9 

machines 

11 

machines 

2 

6.37 

5.08 

4.63 

4.75 

4.78 

4.82 

3 

7.09 

5.99 

4.55 

4.52 

4.65 

4.32 

4 

9.41 

7.26 

6.57 

6.54 

6.48 

6.30 

5 

9.15 

7.79 

6.59 

6.61 

6.42 

6.22 

6 

11.64 

10.76 

9.58 

9.55 

9.26 

9.14 

7 

10.29 

9.88 

8.85 

8.78 

8.51 

8.47 

Table  5a 


Observed  solution  time  -  N  =  15 
v  —  1 


7 

machines 

9 

machines 

11 

machines 

2 

2.69 

2.10 

1.97 

2.03 

2.07 

2.10 

3 

2.89 

2.07 

1.88 

1.87 

1.89 

1.79 

4 

3.85 

3.19 

2.89 

2.87 

2.87 

2.81 

5 

4.05 

3.75 

2.90 

3.24 

3.20 

3.25 

6 

4.13 

4.26 

3.85 

3.80 

3.79 

3.72 

7 

4.21 

4.60 

4.19 

4.14 

4.11 

4.10 

Table  5b 


Observed  solution  Ume  -  N  =  15 

v-2 


Number 

of 

Grids 

1 

machine 

3 

machines 

5 

machines 

7 

machines 

machines 

.   .  J 

11 

machines 

2 

2.26 

2  16 

2.19 

2.21 

2.24 

3 

3.43 

2.55 

2.31 

2.27 

2  26 

4 

3.54 

3.15 

2.81 

2.7b 

2.76 

2.71 

5 

3.74 

3.66 

2.81 

3.19 

3.15 

6 

3.82 

4.19 

3.84 

3.79 

3.78 

3.72 

_  7  _ 

3.89 

4.59 

4.22 

4.22 

4.16 

4.14 

Observed  solution  time  -  N  =  63 

V  2 


Number 

of 

Grids 

r  I 

machine 

3 

machines 

5 

machines 

7 

machines 

9 

machines 

11 

machines 

2 

167.7 

na 

142.0 

140.5 

140.0 

141.4 

3 

111.5 

76.3 

71.4 

69.0 

67.9 

68.3 

4 

67.1 

27.7 

22.4 

19.8 

18.4 

17.9 

5 

IslV' 

25.4 

17.9 

14.3 

12.8 

12.2 

6 

25.3 

17.8 

13.9 

1  1.9 

10.9 

7 

63.4 

25.9 

17.6 

14.3 

12.4 

11.3 

8 

63.5 

26.4 

18.0 

14.9 

12.9 

11.9 

9 

63.7 

18.5 

15.3 

13.2 

12.3 

10 

63.8 

27.5 

19.3 

15.8 

13.8 

12.8 

11 

27.9 

19.9 

16.3 

14.2 

13.5 

Table  7c 


Observed  solution  time  -  N  =  127 

i'  0 


V. 


EP  N  =  15 
v  0 


Number 

of 

Grids 

1 

machine 

4 

machines 

6 

machines 

8 

machines 

10 

machines 

12 

machines 

2 

1.0 

.32 

.23 

.17 

.14 

.11 

3 

1.0 

.29 

.26 

.19 

.15 

.14 

4 

1.0 

.32 

.24 

.18 

.14 

.13 

5 

1.0 

.29 

.23 

.18 

.14 

.12 

6 

1.0 

.27 

.15 

.13 

.11 

Table  9a 


EP  -  N  =  15 

U-  1 


Number 

of 

Grids 

1 

machine 

4 

machines 

6 

machines 

— 

8 

machines 

10 

machines 

12 

machines 

2 

.32 

.23 

.17 

.13 

.11 

3 

.35 

.26 

.19.15 

.14 

4 

.30 

.23 

.17 

.14 

.11 

5 

.27 

.23 

.16 

.13 

.10 

6 

.24 

.18 

.14 

.11 

.09 

Table  9b 


.18 

.14 

.12 

.28 

.23 

.19 

.37 

.32 

.28 

.34 

.31 

.27 

.32 

.28 

.25 

.30 

.26 

.23 

.27 

.23 

.21 

of  |  machine  j  machines  |  machines  |  machines  j  machines  |  machines 
Grids 
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